Introduction

Accurately testing for marijuana usage is very important in different everyday scenarios. For example, THC in marijuana can affect an individuals motor skills, depth perception, and overall cognition. This then can hinder their ability to work effectively and safely. According to the National Institute on Drug Abuse, it is noted that employees that tested positive for marijuana had an increase of 55% when it came to workplace accidents and that they were responsible for another increase of 85% of work-related injuries 1. These liabilities can hurt the company and more importantly, the individual; hence why it is paramount for companies to run effective drug tests on their employees. Another example why finding out which compound and matrix is most effective to use when conducting a drug test is for scenarios in which we’d like to find out if a driver is under the influence or not. Quoted directly from the National Highway Traffic Safety Administration, “In the 2013-2014 survey 2, 12.6 percent of weekend nighttime drivers tested positive for marijuana. That’s a 48-percent increase in less than 10 years” 3.

For our case study, the primary focus is to figure out which compound from which matrix (whole blood, breath, oral fluid) is the best bio-marker for recent use, with “recent use” definied to be within the last 3 hours.

Load packages

library(tidymodels)
library(tidyverse)
library(dplyr)
library(ggplot2)
library(janitor)
library(purrr)
library(rstatix)
library(cowplot)

Question

  • Which compound, in which matrix, and at what cutoff is the best bio-marker of recent use? (within 3h)
  • what’s the optimal model for the relationship between specific compounds?

The Data

The data we will be using comes from a placebo-controlled, double-blinded, randomized study published by Hoffman et. al. (2021). In the study, volunteers were randomly assigned to smoke a ciggertte containing placebo (0.02%), or 5.9% (low dose) or 13.4% (high dose) THC. Samples of whole blood (WB), oral fluid(OF), breath (BR) are taken before and after smoking.

Data Import

WB = read.csv("data/Blood.csv")
BR = read.csv("data/Breath.csv")
OF = read.csv("data/OF.csv")

Data Wrangling

We re-coded and re-leveled variables (Treatment and Group), cleans column names, and renames specific columns (x11_oh_thc to thcoh, thc_v to thcv, thccooh_gluc to thc_cooh_gluc, and thccooh to thc_cooh) in all 3 tables. Using janitor package to organized column names.

OF <- OF |>
  mutate(Treatment = fct_recode(Treatment, 
                                "5.9% THC (low dose)" = "5.90%",
                                "13.4% THC (high dose)" = "13.40%"),
         Treatment = fct_relevel(Treatment, "Placebo", "5.9% THC (low dose)"),
         Group = fct_recode(Group, 
                            "Occasional user" = "Not experienced user",
                            "Frequent user" = "Experienced user" )) |>  
  janitor::clean_names() |>
  rename(thcoh = x11_oh_thc,
         thcv = thc_v)

WB <- WB |> 
  mutate(Treatment = fct_recode(Treatment, 
                                "5.9% THC (low dose)" = "5.90%",
                                "13.4% THC (high dose)" = "13.40%"),
         Treatment = fct_relevel(Treatment, "Placebo", "5.9% THC (low dose)")) |> 
  janitor::clean_names() |>
  rename(fluid = fluid_type,
         thcoh = x11_oh_thc,
         thccooh = thc_cooh,
         thccooh_gluc = thc_cooh_gluc,
         thcv = thc_v)

BR <- BR |> 
  mutate(Treatment = fct_recode(Treatment, 
                                "5.9% THC (low dose)" = "5.90%",
                                "13.4% THC (high dose)" = "13.40%"),
         Treatment = fct_relevel(Treatment, "Placebo", "5.9% THC (low dose)"),
         Group = fct_recode(Group, 
                            "Occasional user" = "Not experienced user",
                            "Frequent user" = "Experienced user" )) |> 
  janitor::clean_names() |> 
  rename(thc = thc_pg_pad)


compounds_WB <-  as.list(colnames(Filter(function(x) !all(is.na(x)), WB[6:13])))
compounds_BR <-  as.list(colnames(Filter(function(x) !all(is.na(x)), BR[6])))
compounds_OF <-  as.list(colnames(Filter(function(x) !all(is.na(x)), OF[6:12])))

Then we created 3 tables based on specific minutes and labeled accordingly, covering pre-smoking and subsequent post-smoking time periods for blood, breath, and oral fluid data.

timepoints_WB <- tibble(
  start = c(-400, 0, 30, 70, 100, 180, 210, 240, 270, 300),
  stop = c(
    0,
    30,
    70,
    100,
    180,
    210,
    240,
    270,
    300,
    max(WB$time_from_start, na.rm = TRUE)
  ),
  timepoint = c(
    "pre-smoking",
    "0-30 min",
    "31-70 min",
    "71-100 min",
    "101-180 min",
    "181-210 min",
    "211-240 min",
    "241-270 min",
    "271-300 min",
    "301+ min"
  )
)

timepoints_BR <- tibble(
  start = c(-400, 0, 40, 90, 180, 210, 240, 270),
  stop = c(
    0,
    40,
    90,
    180,
    210,
    240,
    270,
    max(BR$time_from_start, na.rm = TRUE)
  ),
  timepoint = c(
    "pre-smoking",
    "0-40 min",
    "41-90 min",
    "91-180 min",
    "181-210 min",
    "211-240 min",
    "241-270 min",
    "271+ min"
  )
)

timepoints_OF <- tibble(
  start = c(-400, 0, 30, 90, 180, 210, 240, 270),
  stop = c(0, 30, 90, 180, 210, 240, 270,
           max(OF$time_from_start, na.rm = TRUE)),
  timepoint = c(
    "pre-smoking",
    "0-30 min",
    "31-90 min",
    "91-180 min",
    "181-210 min",
    "211-240 min",
    "241-270 min",
    "271+ min"
  )
)

assign_timepoint <- function(x, timepoints) {
  if (!is.na(x)) {
    timepoints$timepoint[x > timepoints$start & x <= timepoints$stop]
  } else{
    NA
  }
}

We created a new column, timepoint_use, in each table by mapping the time_from_start values to specific timepoints defined in separate reference data frames (timepoints_WB, timepoints_OF, timepoints_BR). Finally, re-leveled the timepoint_use factor variable to align with the order specified in the reference data frames. This ensures consistent and meaningful timepoint labels for subsequent analyses or visualizations in the study.

 WB <- WB |> 
  mutate(timepoint_use = map_chr(time_from_start, 
                                 assign_timepoint, 
                                 timepoints=timepoints_WB),
         timepoint_use = fct_relevel(timepoint_use, timepoints_WB$timepoint))

OF <- OF |> 
  mutate(timepoint_use = map_chr(time_from_start, 
                                 assign_timepoint, 
                                 timepoints=timepoints_OF),
         timepoint_use = fct_relevel(timepoint_use, timepoints_OF$timepoint))

BR <- BR |> 
  mutate(timepoint_use = map_chr(time_from_start, 
                                 assign_timepoint, 
                                 timepoints=timepoints_BR),
         timepoint_use = fct_relevel(timepoint_use, timepoints_BR$timepoint))

remove duplicate id

WB <- drop_dups(WB)
OF <- drop_dups(OF)
BR <- drop_dups(BR)


#im saving a copy of this clean OF for the extended question section. 
OF_ex = drop_dups(OF)

EDA

compounds measurements over time by treatment

The following plots include of all compounds against time, distinguished by color according to their respective groups. To achieve a comprehensive understanding, we generated scatterplots for compounds across three distinct matrices—namely, whole blood, oral fluid, and breath. This analysis encompasses various timepoints and considers different treatments, namely, placebo, low dose, and high dose.

Upon close examination of the scatterplots, a noteworthy observation emerges, particularly concerning the THC biomarker in whole blood. This specific biomarker appears to offer a potentially enhanced indication of recent cannabis joint usage. The scatterplot reveals a discernible separation between the placebo and THC treatment groups, suggesting that the THC measurement in whole blood may serve as a more reliable indicator of recent cannabis joint consumption.

scatter_WB <- map(compounds_WB, ~ compound_scatterplot_group_by_treatment( 
    dataset=WB, 
    compound=.x, 
    timepoints=timepoints_WB))

scatter_OF <- map(compounds_OF, ~ compound_scatterplot_group_by_treatment( 
    dataset=OF, 
    compound=.x, 
    timepoints=timepoints_OF))

scatter_BR <- map(compounds_BR, ~ compound_scatterplot_group_by_treatment( 
    dataset=BR, 
    compound=.x, 
    timepoints=timepoints_BR))

In the presented set of scatterplots, all compounds are graphically depicted against time, with color distinctions denoting different treatment conditions and a log transformation applied to the y-axis, which represents the respective compound measurements. A comparative analysis with the previous scatterplots reveals a modification: specifically, a log transformation has been applied to the y-axis, providing an alternative perspective on the measurement of the compounds.

Upon closer examination, a notable observation emerges. The measurement of THC from breath exhibits a more discernible separation between the placebo and THC treatment groups in the log-transformed scatterplots. This suggests that the log transformation on the y-axis enhances the visibility of distinctions between the treatment conditions for THC. The log transformation, by compressing the scale, may unveil nuances and patterns that are not as apparent on a linear scale. This nuanced insight into THC measurements underscores the importance of considering the impact of transformation techniques when analyzing compound data over time in the context of different treatments. The enhanced separation observed in the log-transformed scatterplots could potentially provide valuable insights into the effects of treatments on THC levels and underscores the sensitivity of the chosen visualization approach.

scatter_WB_by_treatment <- map(compounds_WB, ~ compound_scatterplot_group_by_treatment_log( 
    dataset=WB, 
    compound=.x, 
    timepoints=timepoints_WB))

scatter_OF_by_treatment <- map(compounds_OF, ~ compound_scatterplot_group_by_treatment_log( 
    dataset=OF, 
    compound=.x, 
    timepoints=timepoints_OF))

scatter_BR_by_treatment <- map(compounds_BR, ~ compound_scatterplot_group_by_treatment_log( 
    dataset=BR, 
    compound=.x, 
    timepoints=timepoints_BR))

We can see that some compounds’ measurement are clearly not useful. Here we remove them from all future analysis. The compounds removed include: WB: cbd, thccooh, thccooh_gluc, thcv OF: thcoh

compounds_WB = compounds_WB[- c(2, 5, 6, 8)]
compounds_OF = compounds_OF[- c(4)]

Analysis

Calculating sensitivity and specificity.

output_WB <- map_dfr(compounds_WB,
                     ~ sens_spec_cpd(
                       dataset = WB,
                       cpd = all_of(.x),
                       timepoints =  timepoints_WB
                     )) |> clean_gluc()

output_BR <- map_dfr(compounds_BR, 
                     ~ sens_spec_cpd(
                       dataset = BR,
                       cpd = all_of(.x),
                       timepoints = timepoints_BR
                     ))  |> clean_gluc()

output_OF <- map_dfr(compounds_OF,
                     ~ sens_spec_cpd(
                       dataset = OF,
                       cpd = all_of(.x),
                       timepoints = timepoints_OF
                     ))  |> clean_gluc()

output_WB
## # A tibble: 2,470 × 17
##       TP    FN    FP    TN detection_limit compound time_start time_stop
##    <dbl> <dbl> <int> <int>           <dbl> <chr>         <dbl>     <dbl>
##  1     0     0     1   188           0.5   CBN            -400         0
##  2     0     0     0   189           0.6   CBN            -400         0
##  3     0     0     0   189           0.7   CBN            -400         0
##  4     0     0     0   189           0.753 CBN            -400         0
##  5     0     0     0   189           0.8   CBN            -400         0
##  6     0     0     0   189           0.831 CBN            -400         0
##  7     0     0     0   189           0.9   CBN            -400         0
##  8     0     0     0   189           0.909 CBN            -400         0
##  9     0     0     0   189           1     CBN            -400         0
## 10     0     0     0   189           1.1   CBN            -400         0
## # ℹ 2,460 more rows
## # ℹ 9 more variables: time_window <chr>, NAs <int>, N <int>, N_removed <int>,
## #   Sensitivity <dbl>, Specificity <dbl>, PPV <dbl>, NPV <dbl>,
## #   Efficiency <dbl>
output_BR
## # A tibble: 816 × 17
##       TP    FN    FP    TN detection_limit compound time_start time_stop
##    <dbl> <dbl> <int> <int>           <dbl> <chr>         <dbl>     <dbl>
##  1     0     0     6   185             0.5 THC            -400         0
##  2     0     0     6   185            80.1 THC            -400         0
##  3     0     0     6   185            85.6 THC            -400         0
##  4     0     0     6   185            86.4 THC            -400         0
##  5     0     0     6   185            91.8 THC            -400         0
##  6     0     0     6   185            93.8 THC            -400         0
##  7     0     0     6   185            98.2 THC            -400         0
##  8     0     0     6   185           105.  THC            -400         0
##  9     0     0     6   185           105.  THC            -400         0
## 10     0     0     6   185           107.  THC            -400         0
## # ℹ 806 more rows
## # ℹ 9 more variables: time_window <chr>, NAs <int>, N <int>, N_removed <int>,
## #   Sensitivity <dbl>, Specificity <dbl>, PPV <dbl>, NPV <dbl>,
## #   Efficiency <dbl>
output_OF
## # A tibble: 3,928 × 17
##       TP    FN    FP    TN detection_limit compound time_start time_stop
##    <dbl> <dbl> <int> <int>           <dbl> <chr>         <dbl>     <dbl>
##  1     0     0     5   187            0.5  CBN            -400         0
##  2     0     0     8   184            0.4  CBN            -400         0
##  3     0     0     5   187            0.48 CBN            -400         0
##  4     0     0     3   189            0.6  CBN            -400         0
##  5     0     0     3   189            0.7  CBN            -400         0
##  6     0     0     3   189            0.8  CBN            -400         0
##  7     0     0     3   189            0.9  CBN            -400         0
##  8     0     0     1   191            1    CBN            -400         0
##  9     0     0     1   191            1.1  CBN            -400         0
## 10     0     0     1   191            1.16 CBN            -400         0
## # ℹ 3,918 more rows
## # ℹ 9 more variables: time_window <chr>, NAs <int>, N <int>, N_removed <int>,
## #   Sensitivity <dbl>, Specificity <dbl>, PPV <dbl>, NPV <dbl>,
## #   Efficiency <dbl>

cutoff vs. sensitivity/specificity

Here we plot the value of the cutoff against sensitivity and specificity for every compound in every matrix, and arrange them all into one big plot. This is also known as the ROC curve of sensitivity and specificity against cutoff values suggests an exploration of optimal cutoff points. Overall, the specificity of all compounds increases when detection limit rises. On the other hand, sensitivity drops to zero when detection limit rises.

#arranges ss plots into one
ss_bottom_row <-
  plot_grid(
    ss_OF,
    ss_BR,
    labels = c('B', 'C'),
    label_size = 12,
    ncol = 2,
    rel_widths = c(0.66, .33)
  )
plot_grid(
  ss_WB,
  ss_bottom_row,
  labels = c('A', ''),
  label_size = 12,
  ncol = 1
)

Average sensitivity and specificity vs. detection limit

output_WB_avg = average_sens_spec(output = output_WB)
output_OF_avg = average_sens_spec(output = output_OF)
output_BR_avg = average_sens_spec(output = output_BR)


ss_WB_avg_together <-
  ss_plot_avg_together(output_WB_avg, tpts = length(unique(output_WB$time_start)), tissue = "Blood")

ss_OF_avg_together <-
  ss_plot_avg_together(output_OF_avg, tpts = length(unique(output_WB$time_start)), tissue = "Oral Fluid")

ss_BR_avg_together <-
  ss_plot_avg_together(output_BR_avg, tpts = length(unique(output_WB$time_start)), tissue = "Breath")

It should be apparent that OF-THC is the superior choice. now we dig deeper into OF-THC and find the specific cutoff. referring back to the Average sensitivity and specificity vs. detection limit plot, we see that the detection limit is at…very close to 0 when both sensitivity and specificity are high. Let’s try out some more cutoffs close to 0.

i will now remove every compound where the average sens and spec does not intersect. reasoning: for compounds with no intersection, optimal sensitivity (left most point of the graph) = worst specificity. there is no room for adjustment because any adjustment from there on would just make everything worse.

compounds_WB = c("thc")
compounds_OF = c("thc")
compounds_BR = NULL

sensitivity vs. specificity

In this visual representation, we graph the sensitivity against specificity for each compound within every matrix, consolidating the data into a comprehensive plot. This collective visualization allows for a convenient comparison of the performance of various biomarkers concerning their specificity and sensitivity.

output_WB <- map_dfr(compounds_WB,
                     ~ sens_spec_cpd(
                       dataset = WB,
                       cpd = all_of(.x),
                       timepoints =  timepoints_WB
                     )) |> clean_gluc()


output_OF <- map_dfr(compounds_OF,
                     ~ sens_spec_cpd(
                       dataset = OF,
                       cpd = all_of(.x),
                       timepoints = timepoints_OF
                     ))  |> clean_gluc()

#plot sensitivity vs. specificity
roc_WB = roc_plot(output_WB, tpts = length(unique(output_WB$time_start)), tissue = "Blood")

roc_OF = roc_plot(output_OF, tpts = length(unique(output_OF$time_start)), tissue = "Oral Fluid")

# #arrange roc plots
# roc_bottom_row <-
#   plot_grid(
#     roc_OF,
#     roc_BR,
#     labels = c('B', 'C'),
#     label_size = 12,
#     ncol = 2,
#     rel_widths = c(0.66, .33)
#   )
# plot_grid(
#   roc_WB,
#   roc_bottom_row,
#   labels = c('A', ''),
#   label_size = 12,
#   ncol = 1
# )

plot sensitivity and speciticity over time given specific cutoffs

It should be apparent that OF-THC is the superior choice. now we dig deeper into OF-THC and find the specific cutoff. referring back to the Average sensitivity and specificity vs. detection limit plot, we see that the detection limit is at…very close to 0 when both sensitivity and specificity are high. Let’s try out some more cutoffs close to 0.

plot sensitivity and specificity over time given specific cutoffs

Taking a deeper dive into sensitivity and specificity over time over time for the measurement of THC in Blood and Oral Fluid tissues. In direct comparison between the two measurement methods of THC, it becomes evident that Oral Fluid outshines its counterpart in terms of both sensitivity and specificity, particularly within the critical time span of three hours post-smoking.

#pass specific cutoff into splits parameter
OF_THC <- sens_spec_cpd(
  dataset = OF,
  cpd = 'thc',
  timepoints = timepoints_OF,
  splits =  c(0.5, 1, 2, 5, 10)
) |> clean_gluc()

of_levels <- c("pre-smoking\nN=192", "0-30\nmin\nN=192", "31-90\nmin\nN=117",
               "91-180\nmin\nN=99", "181-210\nmin\nN=102", "211-240\nmin\nN=83",
               "241-270\nmin\nN=90",  "271+\nmin\nN=76")

plot_cutoffs(dataset=OF_THC, 
             timepoint_use_variable=OF$timepoint_use, 
             tissue="Oral Fluid", 
             cpd="THC", 
             x_labels=NULL)
## [[1]]

## 
## [[2]]
## # A tibble: 40 × 18
##       TP    FN    FP    TN detection_limit compound time_start time_stop
##    <dbl> <dbl> <int> <int> <fct>           <chr>         <dbl>     <dbl>
##  1     0     0    35   157 0.5             THC            -400         0
##  2     0     0    20   172 1               THC            -400         0
##  3     0     0     9   183 2               THC            -400         0
##  4     0     0     0   192 5               THC            -400         0
##  5     0     0     0   192 10              THC            -400         0
##  6   129     0    39    24 0.5             THC               0        30
##  7   129     0    30    33 1               THC               0        30
##  8   128     1    19    44 2               THC               0        30
##  9   128     1     3    60 5               THC               0        30
## 10   125     4     1    62 10              THC               0        30
## # ℹ 30 more rows
## # ℹ 10 more variables: time_window <fct>, NAs <int>, N <int>, N_removed <int>,
## #   Sensitivity <dbl>, Specificity <dbl>, PPV <dbl>, NPV <dbl>,
## #   Efficiency <dbl>, my_label <fct>

the average sensitivity is a lot more sensitive (ha) to change than the average specificity - specificity only dips in the 31-90min window when the cutoff is lowered, whereas a lower cutoff increases overall sensitivity all across the board, no matter the time.

the average sensitivity is a lot more sensitive (ha) to change than the average specificity - specificity only dips in the 31-90min window when the cutoff is lowered, whereas a lower cutoff increases overall sensitivity all across the board, no matter the time. additionally, this 31-90min window where the specificity is heavily effected by a low cutoff is, in my opinion, trivial. it should be quite apparent that someone is high if they smoked within the last 90 min. a lowered specificity in this time frame isnt cause for too much concern, considering how much sensitivity is gained via using a low cutoff.

in a nutshell: a low cutoff is optimal. approxiamately somewhere between 0-2. let’s test more cutoffs in this range:

OF_THC <- sens_spec_cpd(
  dataset = OF,
  cpd = 'thc',
  timepoints = timepoints_OF,
  splits =  c(0.1, 0.25, 0.5, 1, 1.5)
) |> clean_gluc()

blood_levels <- c("pre-smoking\nN=189", "0-30\nmin\nN=187", "31-70\nmin\nN=165",
                  "71-100\nmin\nN=157", "101-180\nmin\nN=168", "181-210\nmin\nN=103",
                  "211-240\nmin\nN=127", "241-270\nmin\nN=137", "271-300\nmin\nN=120",
                  "301+\nmin\nN=88")

of_levels <- c("pre-smoking\nN=192", "0-30\nmin\nN=192", "31-90\nmin\nN=117",
               "91-180\nmin\nN=99", "181-210\nmin\nN=102", "211-240\nmin\nN=83",
               "241-270\nmin\nN=90",  "271+\nmin\nN=76")

plot_cutoffs(dataset=OF_THC, 
             timepoint_use_variable=OF$timepoint_use, 
             tissue="Oral Fluid", 
             cpd="THC", 
             x_labels=NULL)
## [[1]]

## 
## [[2]]
## # A tibble: 40 × 18
##       TP    FN    FP    TN detection_limit compound time_start time_stop
##    <dbl> <dbl> <int> <int> <fct>           <chr>         <dbl>     <dbl>
##  1     0     0    37   155 0.1             THC            -400         0
##  2     0     0    37   155 0.25            THC            -400         0
##  3     0     0    35   157 0.5             THC            -400         0
##  4     0     0    20   172 1               THC            -400         0
##  5     0     0    12   180 1.5             THC            -400         0
##  6   129     0    44    19 0.1             THC               0        30
##  7   129     0    44    19 0.25            THC               0        30
##  8   129     0    39    24 0.5             THC               0        30
##  9   129     0    30    33 1               THC               0        30
## 10   128     1    23    40 1.5             THC               0        30
## # ℹ 30 more rows
## # ℹ 10 more variables: time_window <fct>, NAs <int>, N <int>, N_removed <int>,
## #   Sensitivity <dbl>, Specificity <dbl>, PPV <dbl>, NPV <dbl>,
## #   Efficiency <dbl>, my_label <fct>

they all look pretty promising… we need a way to quantify this. I am gonna calculate the sensitivity and specificity for cutoff values in between 0 and 2.

output_OF = sens_spec_cpd_OFTHC(
                       dataset = OF,
                       cpd = "thc",
                       timepoints = timepoints_OF
                     )  |> clean_gluc()

output_OF_avg = average_sens_spec(output = output_OF)

output_OF_avg
## # A tibble: 101 × 4
##    compound detection_limit average_sensitivity average_specificity
##    <chr>              <dbl>               <dbl>               <dbl>
##  1 THC                 0                  0.956               0    
##  2 THC                 0.02               0.956               0.817
##  3 THC                 0.04               0.956               0.817
##  4 THC                 0.06               0.956               0.817
##  5 THC                 0.08               0.956               0.817
##  6 THC                 0.1                0.956               0.817
##  7 THC                 0.12               0.956               0.817
##  8 THC                 0.14               0.956               0.817
##  9 THC                 0.16               0.956               0.817
## 10 THC                 0.18               0.956               0.817
## # ℹ 91 more rows

lets plot this really quick

ss_OF_avg_together <-
  ss_plot_avg_together(output_OF_avg, tpts = length(unique(output_WB$time_start)), tissue = "Oral Fluid")

oh we found it. the place where they intersect is the maximum of sensitivity+specificity. lets get the specific value

output_OF_avg |>
  mutate(diff = abs(average_sensitivity-average_specificity)) |>
  arrange(diff)
## # A tibble: 101 × 5
##    compound detection_limit average_sensitivity average_specificity    diff
##    <chr>              <dbl>               <dbl>               <dbl>   <dbl>
##  1 THC                 0.82               0.893               0.890 0.00338
##  2 THC                 0.84               0.893               0.890 0.00338
##  3 THC                 0.86               0.893               0.890 0.00338
##  4 THC                 0.88               0.893               0.890 0.00338
##  5 THC                 0.9                0.893               0.890 0.00338
##  6 THC                 0.7                0.903               0.879 0.0240 
##  7 THC                 0.72               0.903               0.879 0.0240 
##  8 THC                 0.74               0.903               0.879 0.0240 
##  9 THC                 0.76               0.903               0.879 0.0240 
## 10 THC                 0.78               0.903               0.879 0.0240 
## # ℹ 91 more rows

at cutoff 0.82-0.90, the difference between the average sensitivity and average specificity is minimized. we’ll pick 0.85 for aestheicism’s sake.

Results & Discussion

During our thorough exploratory data analysis, a discernible pattern emerged, highlighting the potency of the THC compound in effectively indicating recent marijuana usage. Consequently, our focus in the data analysis section specifically hones in on the sensitivity and specificity cutoff measurements within the Blood and Oral Fluid tissues. Notably, Oral Fluid consistently exhibits superior sensitivity across all timepoints. By scrutinizing the Receiver Operating Characteristic (ROC) curve comparing the THC measurements in Blood and Oral Fluid, a clear trend emerges – Oral Fluid surpasses Blood in accuracy. In simpler terms, Oral Fluid proves more adept at detecting recent marijuana joint usage compared to its Blood counterpart. This confirmation underscores the THC measurement in Oral Fluid as the paramount biomarker for recent use.

Moving forward, our focus shifts to determining the optimal cutoff values for both sensitivity and specificity. To achieve this, we calculate the average sensitivity and specificity across all time windows for various detection limits, identifying the point at which these metrics intersect – a key indicator of the optimal cutoff. Upon plotting the graph and meticulously examining the associated table, a convergence becomes evident at detection limits ranging from 0.82 to 0.90. Hence, it becomes apparent that the 0.82 to 0.90 detection limit of the THC compound in Oral Fluid stands out as the most effective biomarker for recent use.

Moreover, our exploration extends to a broader question. The visualizations shed light on the distinct responses of various compounds to marijuana dosage. Notably, we focus on key comparisons: CBG with THC in Blood, CBN with CBG in Oral Fluid, and CBG with THCV in Oral Fluid. The analysis of CBG with THC in Blood reveals a noteworthy observation – the coefficient of THC in the low dose group significantly exceeds that in the high dose group. This implies that, for the low dose group, each increment in THC correlates more strongly with CBG compared to the high dose group. Shifting attention to the comparison between CBN with CBG in Oral Fluid, the coefficient of CBG in the high dose group is notably higher. This suggests that as the dosage increases, a heightened correlation emerges between CBG and CBN. These nuanced insights deepen our understanding of compound interactions in response to varying marijuana doses.

The results of our case study on biomarkers of recent marijuana use reveal intriguing insights into the choice of compound, matrix, and cutoff for effective detection. The overarching goal was to identify the most potent biomarker that, when combined with a specific matrix and cutoff, can reliably indicate recent marijuana use within a three-hour timeframe.

Our analysis consistently points to THC in oral fluid as the optimal biomarker for recent marijuana use. The scatterplots depicting compound measurements over time clearly show a distinct separation between the placebo and THC treatment groups, particularly in oral fluid. The ROC curve analysis further supports this conclusion, with oral fluid THC demonstrating superior sensitivity across all time points compared to other matrices. The enhanced sensitivity in oral fluid makes it a compelling choice for detecting recent marijuana use accurately.

To determine the most effective cutoff for sensitivity and specificity, we conducted a thorough analysis, revealing that a cutoff value between 0.82 and 0.90 maximizes the balance between sensitivity and specificity. We opted for 0.85 as the optimal cutoff, considering the intersection point where the difference between average sensitivity and specificity is minimized. This ensures a practical and balanced approach for identifying recent marijuana use.

Limitations & Suggestions

there are some other possible ways of going about this… for example, someone who is quite recently high should be obviously high to an officer, so it is viable to give the samples in the 0-31min time window less weight when considering overall sensitivity and specificity. my reasoning for not doing is because any sort of “weight” would be quite arbitrary and meaningless.

Conclusion

the optimal biomarker is OF,THC, at cutoff = 0.85.

Extended Question

Extended Question: wrangling

WB_long = WB |>
  pivot_longer(6:13, names_to = "compound")

OF_long = OF |>
  pivot_longer(6:12, names_to = "compound")

BR_long <- BR |> pivot_longer(6)

df_full <- bind_rows(WB_long, OF_long, BR_long)

The Pairplot That Started it All

OF = OF_ex
compounds_OF <-  as.list(colnames(Filter(function(x) !all(is.na(x)), OF[6:12])))

pairs(OF[,unlist(compounds_OF)], 
      pch=19, 
      cex=0.4, 
      cex.labels=0.6,
      labels=gsub('GLUC','gluc',gsub("_","-",toupper(colnames(OF[,unlist(compounds_OF)])))))

This is a pair plot showing the correlation between different compounds in th OF matrix. Notice how some of them look like two separate lines. This should be a clear sign that there’s some underlying variable that effects both of the compounds. After some trial and error, we found that this observation is due to the treatment variable - specifically, low dose and high dose group seems to have different slopes when it comes to the correlation between certain compounds.

Here’s a few of the more obvious ones:

OF = OF_ex |>
  filter(treatment != "Placebo")

  ggplot(data = OF,
       mapping = aes(y = thc, x = cbn, color = treatment)) +
    geom_point() +
    geom_smooth(method = "lm", formula = y ~ x)

        ggplot(data = OF,
       mapping = aes(y = thc, x = cbg, color = treatment)) +
    geom_point() +
    geom_smooth(method = "lm", formula = y ~ x)

Also, we are removing the placebo group since it is not of interest.

This seems to show that low dose/high dose they change the correlation between chemicals. Interesting. For this report, we will narrow our target of investigation to THC vs CBG in OF.

If there is, in fact, some significant difference between compound correlation between different treatment, we should be able to construct some sort of model that, given a sample of the measurement of those two compounds, would be able to predict which treatment group this sample is pulled from to some meaningful degree of accuracy.

Let’s do some analysis.

Extended Question: EDA

First we try fitting a single variable linear model with THC as the value, and CBG as the explanatory variable.

lm_cbg = linear_reg() |>
  set_engine("lm") |>
  fit(thc ~ cbg, data = OF) 

lm_cbg
## parsnip model object
## 
## Fit time:  4ms 
## 
## Call:
## stats::lm(formula = thc ~ cbg, data = data)
## 
## Coefficients:
## (Intercept)          cbg  
##       60.30        14.35
glance(lm_cbg)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.850         0.849  446.     3581. 5.65e-263     1 -4782. 9570. 9583.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Here the resulting linear model is THC = 14.41CBG + 39.8 with an R^2 of 0.85.

Now we test if the addition of treatment significantly improves the model by doing multiple linear regression:

#multiple linear regression model with only main effect
mlm_cbg1 = linear_reg() |>
  set_engine("lm") |>
  fit(thc ~ cbg + treatment, data = OF) 

#multiple linear regression model with main effect + interaction effect
mlm_cbg2 = linear_reg() |>
  set_engine("lm") |>
  fit(thc ~ cbg * treatment, data = OF) 

mlm_cbg1
## parsnip model object
## 
## Fit time:  4ms 
## 
## Call:
## stats::lm(formula = thc ~ cbg + treatment, data = data)
## 
## Coefficients:
##                    (Intercept)                             cbg  
##                         -16.64                           14.45  
## treatment13.4% THC (high dose)  
##                         156.34
glance(mlm_cbg1)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.854         0.854  440.     1853. 2.41e-265     2 -4772. 9552. 9570.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
mlm_cbg2
## parsnip model object
## 
## Fit time:  2ms 
## 
## Call:
## stats::lm(formula = thc ~ cbg * treatment, data = data)
## 
## Coefficients:
##                        (Intercept)                                 cbg  
##                            48.4663                             11.6821  
##     treatment13.4% THC (high dose)  cbg:treatment13.4% THC (high dose)  
##                             0.5256                             12.9414
glance(mlm_cbg2)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.969         0.969  202.     6652.       0     3 -4277. 8563. 8585.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Adding in treatment as another explanatory variable and including interaction effects produces a noticeably more accurate model - with a R^2 of ~0.97.

Just to be thorough, let’s try another way. Here we fit a linear model for high does and low dose group separately.

OF_low <- OF |>
  filter(treatment == "5.9% THC (low dose)")

OF_high <- OF |>
  filter(treatment == "13.4% THC (high dose)")

#linear regression `cbn ~ cbg` for 5.9% THC (low dose) group
lr_cbg_low = linear_reg() |>
  set_engine("lm") |>
  fit(cbn ~ cbg, data = OF_low)

glance(lr_cbg_low)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.945         0.945  34.7     5641. 1.02e-208     1 -1638. 3282. 3293.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
#linear regression `cbn ~ cbg` for 13.4% THC (high dose) group
lr_cbg_high = linear_reg() |>
  set_engine("lm") |>
  fit(cbn ~ cbg, data = OF_high) 

glance(lr_cbg_high)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.913         0.913  67.8     3191. 3.01e-163     1 -1724. 3453. 3464.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Extended Question: Results & Discussion

We attempted four different ways of modeling the relationship between CBG and THC: directly fitting a linear model with CBG as the only explanatory variable, fitting a multiple linear model with both CBG and treatment as explanatory variables but only with main effect, multiple linear model with both CBG and treatment as explanatory variables with main effect and interaction effect, and finally fitting a linear model for the low dose and high dose group separately. Out of all these models, the multiple linear model with two explanatory variables and interaction effect proves to be the best model, explaining almost 97% of all observed data. This is quite impressive, especially considering that even the linear models that are fitted onto high dose and low dose group separately did not achieve this level of accuracy.

This result has both practical implications and potential to serve as grounf for further research. On the one hand, being able to accurate predict the value of another measurement based on one measurement and information about the dosage of THC can significantly lower the cost, time, and energy required for data collection. On the other hand, this predictive power of dosage in terms of how certain compounds relate to each other seem to suggest a non-linear relationship between dosage and certain compound’s value. Further research is required to investigate this proposition.


    1. Marijuana at Work: What Employers Need to Know. NSC. https://www.nsc.org/nsc-membership/marijuana-at-work#:~:text=According%20to%20a%20study%20reported,Decreased%20productivity
    ↩︎
  1. Research Note: Results of the 2013-2014 National Roadside Survey of Alcohol and Drug Use by Drivers. NHTSA. https://www.nhtsa.gov/sites/nhtsa.gov/files/812118-roadside_survey_2014.pdf↩︎

  2. Drug-Impaired Driving. NHTSA. https://www.nhtsa.gov/risky-driving/drug-impaired-driving#:~:text=In%202007%2C%20NHTSA’s%20National%20Roadside,in%20less%20than%2010%20years.↩︎